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Abstract 

Spectral kernel methods are techniques for transforming data into a coordinate system 
that efficiently reveals the geometric structure — in particular, the "connectivity" — of the data. 
These methods depend on certain tuning parameters. We analyze the dependence of the method 
on these tuning parameters. We focus on one particular technique — diffusion maps — but our 
analysis can be used for other methods as well. We identify the population quantities implicitly 
being estimated, we explain how these methods relate to classical kernel smoothing and we 
define an appropriate risk function for analyzing the estimators. We also show that, in some 
cases, fast rates of convergence are possible even in high dimensions. 
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Figure 1 : Synthetic data in original and diffusion coordinates 



1 Introduction 



There has been growing interest in spectral kernel methods such as spectral clustering (von Luxburg 



2007), Laplacian maps (Bel kin and Niyogi, 2003| ), Hessian maps ( Donoho and Grimes, 2003 [ ), and 



locally linear embeddings (Roweis and Saul, 2000[). The main idea behind these methods is that the 



geometry of a data set can be analyzed using certain operators and their corresponding eigenfunc- 
tions. These eigenfunctions describe the main variability of the data and often provide an efficient 
parameterization of the data. 

Figure [TJ shows an example. The left plot is a synthetic dataset consisting of a ring, a blob, and 
some uniform noise. The right plot shows the data in a new parameterization computed using the 
methods described in this paper. In this representation the data take the form of a cone. The data 
can be much simpler to deal with in the new parameterization. For example, a linear plane will 
easily separate the two clusters in this parameterization. In high-dimensional cases the reparame- 
terization leads to dimension reduction as well. Figure [2] shows an application to astronomy data. 
Each point in the low-dimensional embedding to the right represents a galaxy spectrum (a function 
that measures photon flux at more than 3000 different wavelengths). The results indicate that by 
analyzing only a few dominant eigenfunctions of this highly complex data set, one can capture the 
variability in redshift (a quantity related to the distance of a galaxy from the observer) very well. 

More generally, the central goal of spectral kernel methods can be described as follows: 

Find a transformation Z = such that the structure of the distribution P z is 
simpler than the structure of the distribution P x while preserving key geometric 
properties of P x - 

"Simpler" can mean lower dimensional but can be intepreted much more broadly as we shall see. 

These new methods of data reparameterization are more flexible than traditional methods such 
as principal component analysis, clustering and kernel smoothing. Applications of these methods 



include: manifold learning, (Bickel and Levina, 2004), fast internet web searches (Page et al., 



Diffusion Map 




Figure 2: Left: Flux versus wavelength for a typical Sloan Digital Sky Survey (SDSS) galaxy spectrum. Right: 
Embedding of a sample of 2,793 SDSS galaxy spectra using the first 3 diffusion map coordinates. The color codes for 
redshift. The reparameterization shows a clear correspondence with variations in redshift, even though redshift was 
not taken into account in the construction. (Reproduced from Richards et al. (2009 )) 



1998), semi- supervised learning for regression and classification (Szummer and Jaakkola, 2001 



Lafferty and Wasserman, 2007), inference of arbitrarily shaped clusters, etc. The added flexibil- 



ity however comes at a price: there are tuning parameters, such as a kernel bandwidth e, and the 
dimension q of the embedding that need to be chosen and these parameters often interact in a com- 
plicated way. The first step in understanding these tuning parameters is to identify the population 
quantity these methods are actually estimating, then define an appropriate loss function. 

We restrict our discussion to Laplacian-based methods, though the analysis generalizes to other 



spectral kernel methods. Several authors, including |Coif man and Lafon (2006), Belkin and Niyogi 



(2005), [Hein et al. (2005) and |Singer (20 06), and |Gine and Koltchinskii (2006| ) have studied the 



convergence of the empirical graph Laplacian to the Laplace-Beltrami operator of a smooth man- 
ifold as the sample size n — > oo and the kernel bandwidth e — ► 0. In all these studies, the data 
are assumed to lie exactly on a Riemannian submanifold in the ambient space W. Although the 
theoretical framework is appealing, there are several concerns with this approach: (i) distributions 
are rarely supported exactly on a manifold, (ii) even in cases where the manifold assumption is ap- 
proximately reasonable, the bias-variance calculations do not actually take into account stochastic 
variations about a perfect manifold, (iii) the calculations give no information on how the parame- 
ters in the model (such as for example the number of eigenvectors in the embedding) depend on 
the sample size n and the dimension p when noise is present. 

We drop the manifold assumption and instead consider data that are drawn from some general 
underlying distribution. Recently, other work has taken a similar approach. For example 



von 



Luxburg et al. (2008 ) study the consistency of spectral clustering. For a fixed kernel bandwidth e 



and in the limit of the sample size n — > oo, the authors show that the eigenvectors of the graph 
Laplacian converge to the eigenvectors of certain limit operators. In this paper, we allow e to go to 
0. 



4 



The goals of the paper are to: 



1. identify the population quantities being implicitly estimated in Laplacian-based spectral 
methods, 

2. explain how these methods relate to classical kernel smoothing methods, 

3. find the appropriate risk and propose an approach to choosing the tuning parameters. 

We show that spectral methods are closely related to classical kernel smoothing. This link provides 
insight into the problem of parameter estimation in Laplacian eigenmaps and spectral clustering. 
The real power in spectral methods is that they find structure in the data. In particular, they perform 
connectivity learning, with data reduction and manifold learning being special cases. 

Laplacian-based kernel methods essentially use the same smoothing operators as in traditional 
nonparametric statistics but the end goal is not smoothing. These new kernel methods exploit the 
fact that the eigenvalues and eigenvectors of local smoothing operators provide information on the 
underlying geometry of the data. 

In this paper, we describe a version of Laplacian-based spectral methods, called diffusion maps. 
These techniques capture multiscale structure in data by propagating local neighborhood informa- 
tion through a Markov process. Spectral geometry and higher-order connectivity are two new 
concepts in data analysis. In this paper, we show how these ideas can be incorporated into a tradi- 
tional statistical framework, and how this connection extends classical techniques to a whole range 
of new applications. We refer to the resulting method as Spectral Connectivity Analysis (SCA). 

2 Review of Spectral Dimension Reduction Methods 

The goal of dimensionality reduction is to find a function ^ that maps our data X from a space 
X to a new space Z where their description is considered to be simpler. Some of the methods 
naturally lead to an eigen-problem. Below we give some examples. 

2.1 Principal Component Analysis and Multidimensional Scaling 

Principal component mapping is a simple and popular method for data reduction. In principal 
component analysis (PCA), one attempts to fit a globally linear model to the data. If S is a set, 
define 

R(S)=E\\X-n s X\\ 2 (1) 

where nsX is the projection of X onto S. Finding argmin 5gC i?(S'), where C is the set of all q- 
dimensional planes, gives a solution that corresponds to the first q eigenvectors of the covariance 
matrix of X. 

In principal coordinate analysis, the projections ir s x = (z 1: . . . , z q ) on these eigenvectors are 
used as coordinates of the data. This method of reparameterization is also known as classical or 
metric multidimensional scaling (MDS). The goal here is to find a lower-dimensional embedding 
of the data that best preserves pairwise Euclidean distances. Assume that X and Y are covariates 
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in IR P . One way to measure the discrepancy between the original configuration and its embedding 
is to compute 

R(V) = E (d(X,Y) 2 - - *(r)|| 2 ) = J (d(x,y) 2 - \\*(x) - V(y)\\ 2 ) dP{x)dP{y) , 

where d(x,y) 2 = \\x — y\\ 2 . One can show that amongst all linear projections ^ = ir s onto Q- 
dimensional subspaces of M. p , this quantity is minimized when the data are projected onto their first 



q principal components (Mardia et al., 1980). Thus, there is a close connection between principal 
component analysis, which returns the span of a hyperplane, and classical MDS or "principal 
coordinate analysis", which returns the new parameterization ty(x) = (zi, . . . , z q ). 

The duality between PC A and MDS is also directly apparent in empirical computations: Let X 
be an n x p data matrix, where the rows are observations Xi 6 W centered so that - Yli=x x i = 0- 
The solution to PCA is then given by the principal eigenvectors {ve } of the pxp sample covariance 
matrix § = ^X T X. The solution to the MDS problem, on the other hand, is given by the rescaled 
eigenvectors of the n x n Gram or (the positive semi-definite) inner product matrix K = XX T , 
where element j) = (xj, Xj). If {A^, ui} are the principal eigenvalues and eigenvectors of K, 

1 /2 1/2 

then ty(xi) = (A/ ui(i), X 2 ' u 2 (i), ■ ■ .)• 

2.2 Non-Linear Methods 

For complex data, a linear model may not be adequate. There are a large number of non-linear data 
reduction methods; some of these are direct generalizations of the PCA projection method. For 



example, local PCA (Kambhatla and Leen, 1997) partitions the data space into different regions 



and fits a hyperplane to the data in each partition. In principal curves ( |Hastie and StuetzleTT9 89), 
the goal is to minimize a risk of the same form as in Equation [T] but with S representing some 
class of smooth curves or surfaces. 



Among non-linear extensions of PCA and MDS, we also have kernel PCA (Scholkopf et al., 



1998 1 which applies PCA to data $(JT) in a higher (possibly infinite) dimensional "feature space". 
The kernel PCA method never explicitly computes the map $, but instead expresses all calculations 
in terms of inner products k(x, y) = ($(x), Q(y)) where the "kernel" k is a symmetric and positive 

semi-definite function. Common choices include the Gaussian kernel k(x, y) = exp ' — ^— — 



is 



and the polynomial kernel k(x,y) = (x,y) r , where r = 1 corresponds to the linear case in Sec. 2.1 



As shown in Bengio et al. (20041), the low-dimensional embeddings ty(x) used by eigenmap and 



spectral clustering methods are equivalent to the projections (of on the principal axes in 
feature space) computed by the kernel PCA method. 

In this paper, we study diffusion maps, a particular spectral embedding technique. Because of 
the close connection between MDS, kernel PCA and eigenmap techniques, our analysis can be used 
for other methods a well. Below we start by providing some background on spectral dimension 
reduction methods from a more traditional graph-theoretic perspective. In the next section we 
begin our main analysis. 
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2.2.1 Laplacian eigenmaps and other locality-preserving spectral methods 

Most spectral methods take a data-analytic rather than a probabilistic approach to dimension reduc- 
tion. The usual strategy is to construct an adjacency graph on a given data set and then find the op- 
timal clustering or parameterization of the data that minimizes some empirical locality-preserving 
objective function on the graph. 

For a data set with n observations, we define a graph G = (V, E), where the vertex set V = 
{1, . . . , n} denotes the observations, and the edge set E represents connections between pairs of 
observations. Typically, the graph is also associated with a weight matrix K that reflects the "edge 
masses" or strengths of the edge connections. A common choice for data in Euclidean space is to 
start with a Gaussian kernel: Define K.(u, v) = exp ^— W Xu -* v W j f or a \\ d a ta pairs (x u ,x v ) with 

(u, v) E E, and only include cases where the weights K(w, v) are above some threshold 5 in the 
definition of the edge set E . 

Consider now a one-dimensional map / : V — > R that assigns a real value to each vertex; 
we will later generalize to the multidimensional case. Many spectral embedding techniques are 
locality-preserving; e.g. locally linear embedding, Laplacian eigenmaps, Hessian eigenmaps, local 
tangent space alignment, etc. These methods are special cases of kernel PC A, and all aim at 
minimizing distortions of the form 

W) = ^>(/) (2) 

under the constraints that = 1- Typically, Q v (f) is a symmetric positive semi-definite 

quadratic form that measures local variations of / around vertex v, and Qm(I) is a quadratic 
form that acts as a normalization for /. For Laplacian eigenmaps, for example, the neighborhood 
structure of G is described in terms of the graph Laplacian matrix 

L = M - K, 

where M = diag(p 1; . . . , p n ) is a diagonal matrix with p u = J2 v K(u, v) for the "node mass" or 
degree of vertex u. The goal is to find the map / that minimizes the weighted local distortion 

Q(f) = fhf= n^v)(f(u)-f(v)) 2 > 0, (3) 

(u,v)eE 

under the constraints that 

Q M (f) = f t Mf = J2m v f(v) 2 = l 

and (to avoid the trivial solution of a constant function) / T M1 = 0. Minimizing the distortion in 
^ forces f(u) and f(v) to be close if K(w, v ) is large. From standard linear algebra it follows that 
the optimal embedding is given by the eigenvector of the generalized eigenvalue problem 

hf = fjMf (4) 

with the smallest non-zero eigenvalue. 
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We can easily extend the discussion to higher dimensions. Let / 1( . . . , f g be the q first non- 
trivial eigenvectors of normalized so that ffMfj = Sij, where 5ij is Kronecker's delta func- 
tion. The map / : V — > IR 9 , where 

f = (h,...J q ), (5) 

is the Laplacian eigenmap (Belkin and Niyogi, 2003|) of G in q dimensions. It is optimal in the 



sense that it provides the g-dimensional embedding that minimizes 

52 K(u,v)\\f(u)-f(v)\\ 2 = j2f^ (6) 
(u,v)e.E i=l 

in the subspace orthogonal to Ml, under the constraints that fJ'M.fj = 5^ for i, j = 1, . . . , q. 

If the data points x u lie on a Riemannian manifold A4, and / : M. — > K is a twice differen- 
tiable function, then the expression in Eq.|3]is the discrete analogue on graphs of J M ||Vx/|| 2 = 
— J M (A M f)f, where Vm and Am, respectively, are the gradient and Laplace-Beltrami opera- 
tors on the manifold. The solution of argmin||/|| = i J M ||V^/|| 2 is given by the eigenvectors of 
the Laplace-Beltrami operator A M . To give a theoretical justification for Laplacian-based spectral 
methods, several authors have derived results for the convergence of the graph Laplacian of a point 



cloud to the Laplace-Beltrami operator under the manifold assumption; see Belkin and Niyogi 



(2005| ); [Coifman and Lafon (2006) ; [Singer (2006| ); |Gine and Koltchinskii (2006) 



2.2.2 Laplacian-based methods with an explicit metric 

Diffusion mapping is an MDS technique that belongs to the family of Laplacian-based spectral 
methods. The original scheme was introduced in the thesis work by Lafon (2004) and in |Coif man 



et al. (2005a|b ). See also independent work by Fouss et al. (2005[ ) for a similar technique called 



Euclidean commute time (ECT) maps. In this paper, we will describe a slightly modified version 



of diffusion maps that appeared in (Coifman and Lafon, 2006 , Lafon and Lee, 2006 



The starting point of the diffusion framework is to introduce a distance metric that reflects the 
higher-order connectivity of the data. This is effectively done by defining a diffusion process or 
random walk on the data. 

As before, we here describe a graph approach where the nodes of the graph represent the 
observations in the data set. Assuming non-negative weights K and a degree matrix M, we define 
a row-stochastic matrix A = M _1 K. We then imagine a random walk on the graph G = (V, E) 
where A is the transition matrix, and element A(u, v) corresponds to the probability of reaching 
node v from u in one step. Now if A m is the m th matrix power of A, then element A m (u,v) 
can be interpreted as the probability of transition from u to v in m steps. By increasing m, we 
are running the Markov chain forward in time, thereby describing larger scale structures in the 
data set. Under certain conditions on K, the Markov chain has a unique stationary distribution 

S (v) = p{v)/Y, u &V P( U )- 



'See http://www.stat. emu. edu/^annlee/software. htm for example code in Matlab and R. 
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We define the diffusion distance between nodes u and v as a weighted L 2 distance between the 
two distributions A m (u, •) and A m (v, ■), 

ifcev v ' 

This quantity captures the higher-order connectivity of the data at a scale m and is very robust 
to noise since it integrates multiple- step, multiple-path connections between points. The distance 
D m (u, v) 2 is small when A m (u, v) is large, or when there are many paths between nodes u and v 
in the graph. 

As in multidimensional scaling, the ultimate goal is to find an embedding of the data where 
Euclidean distances reflect similarities between points. In classical MDS, one attempts to preserve 
the original Euclidean distances d 2 (u,v) = \\x u — x v \\ 2 between points. In diffusion maps, the 
goal is to approximate diffusion distances D^u, v). One can show (see appendix) that the optimal 
embedding in q dimensions is given by a "diffusion map" ty m , where the coordinates of the data 
are the (rescaled) right eigenvectors of the Markov matrix A. In fact, assuming the kernel matrix 
K is positive semi-definite, we have that 

v e V h- * m (v) = (ATViOO, WMv), ■ ■ ■ , KM")) e ^ , 

where {ipi}e>o are the principal eigenvectors of A and the eigenvalues Ao = 1 > X% > ■ ■ ■ 0. 
This solution is, up to a rescaling, the same as the solution of Laplacian eigenmaps and spectral 
clustering, since 

hip = fjMip <^> Aip = Xip . 

for A = 1 — fi and L = M — K. The diffusion framework provides a link between Laplacian- 
based spectral methods, MDS and kernel PCA, and can also be generalized to multiscale geome- 



tries (Coifman et al., 2005b, Coifman and Maggioni, 2006) 



Remark 1 The link to MDS and kernel PCA is even more explicit in the original version of diffu- 



sion maps ( Coifman et al., 2005a ), which is based on the symmetric (positive semi-definite) kernel 
matrix A m =^M 1 / 2 A m M~ 1 / 2 = I — M -1 / 2 LM -1 / 2 , and the metric D 2 m (u,v) = A m (u,u) + 
A m (v,v) — 2A m (u,v) induced by this kernel. In classical MDS and linear PCA, the analogue 
is a positive semi-definite kernel matrix K, where K(w, v) = (x u , x v ), and Euclidean distances 
d 2 (u, v) = \\x u — a^H 2 = K(u, u) + K(v, v) — 2K(u, v). In both cases, the data are parameterized 
by the rescaled principal eigenvectors (A* 0i, X^ 2 ^, ■ ■ ■) of the kernel matrix associated with the 
metric. 



3 Diffusion Maps 

The diffusion map creates a distribution- sensitive reparameterization. We will study the method 
under the assumption that the data are drawn from an underlying distribution. We begin by intro- 
ducing a Markov chain that plays an important role in the definition of the diffusion map. 
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3.1 A Discrete-Time Markov Chain 

Definitions. Suppose that the data X\,... ,X n are drawn from some underlying distribution P 
with compact support X C M. d . We assume P has a density p with respect to Lebesgue measure \i. 
Let 

1 / \x — y\ 2 



denote the Gaussian kerne^] with bandwidth h = y/2s. We write the bandwidth in terms of e 
instead of h because e is more natural for our purposes. Consider the Markov chain with transition 
kernel Vt e {x, •) defined by 

«,<«, A) = P(x A) = Y t f' V) Z!1 = LUX f dny) (8) 

J k £ (x } y)dP{y) p £ (x) 

where p £ (x) = J k £ (x,y)dP(y). 

Starting at x, this chain moves to points y close to x, giving preference to points with high 
density p(y). In a sense, this chain measures the connectivity of the sample space relative to p. The 
stationary distribution S £ is given by 

, x f/i Pe(x)dP(x) 

S £ (A) - JA K ' K J 



and 



/ p e (x)dP(x) 

I A P( x ) dP ( x ) n 
eV ; / p{x)dP{x) 



Define the densities 



, s dtt £ k £ (x,y)p(y) 
u e (x,y) = —{x,y) = — 

(Z/i Pe{X) 

dfl £ k £ (x,y) 
ae{x,y) = — 5 -{x,y) = -— . 

dP Pe{X) 

The diffusion operator A £ — which maps a function / to a new function A £ f — is defined by 

A f! x f ( \ -p / wd/ \ I k e(x,y)f(y)dP(y) 

A £ f(x) = j a £ (x, y)f(y)dP(y) = j k ^ y)dp{y) • (9) 

We normalize the eigenfunctions {ip e ,o, ipe,u • • •} of A e by 

i^l l {x)s £ {x)dP{x) = 1, 



2 Other kernels can be used. For simplicity, we will focus on the Gaussian kernel which is also the Green's function 
of the heat equation in R d . 
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where 

= p e (x) 
Se[X) Jp £ (y)dP(y) 

is the density of the stationary distribution with respect to P. The first eigenfunction of the operator 
A e is ij) e fl{x) = 1 with eigenvalue A £j0 = 1. In general, the eigenfunctions have the following 
interpretation: ip s j is the smoothest function relative to p, subject to being orthogonal to ip £yi , 
i < j. The eigenfunctions form an efficient basis for expressing smoothness, relative to p. If a 
distribution has a few well defined clusters then the first few eigenfunctions tend to behave like 
indicator functions (or combinations of indicator functions) for those clusters. The rest of the 
eigenfunctions provide smooth basis functions within each cluster. These smooth functions are 
Fourier- like. Indeed, the uniform distribution on the circle yields the usual Fourier basis. Figure [3] 
shows a density which is a mixture of two Gaussians. Also shown are the eigenvalues and the first 
4 eigenfunctions which illustrate these features. 

Denote the m-step transition measure by Q £tm (x, •). Let A £)Tn be the corresponding diffusion 
operator which can be written as 



y)f(y)dP(y) 



where a e , m {x,y) = dQ £tin /dP. 

Define the empirical operator A £ by 



AJ(x) = S^M^I^ = / uM.iDfUlXlPAH) (10) 



y^-i k £ (x, X,, 

where P n denotes the empirical distribution, a £ (x, y) = k e (x, y)/p e (x) and 



Ps{X) 



r - i n 

/ k £ {x,y)dP n {y) = -Y j k £ {x,X l ) (11) 
J nj^ 



is the kernel density estimator. Let A £>m be the corresponding m-step operator. Let i\) £ ^ denote 
the eigenvectors of the matrix A e where A £ (j, k) = k £ (Xj, X k ) /p £ (Xj). These eigenvectors are 
estimates of ip£ at the observed values X 1; . . . , X n . The function ipi(x) can be estimated at values 
of x not corresponding to one of the X/s by kernel smoothing as follows. The eigenfunction- 
eigenvalue equation \ E tip e l = A £ ijj £j e can be rearranged as 



suggesting the estimate 



, / x A e ip e ,t I k £ {x,y)^eAy) dp (y) nn , 

^PeAx) = — = — y— — — (12) 

^e,e X £ .eJk £ (x,y)dP(y) 



fax) = T.Mx.x^Xi) (13) 



which is known in the applied mathematics literature as the Nystrom approximation. 
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Interpretation. The diffusion operators are averaging operators. Equation ([9]) arises in non- 
parametric regression. If we are given regression data = f{Xi) + 6j, i = 1, . . . , n, then the 
kernel regression estimator of / is 



Replacing the sample averages in (14) with their population averages yields Q. One may then 
wonder: in what way spectral smoothing is different from traditional nonparametric smoothing? 
There are at least three differences: 



1. Estimating A £ is an unsupervised problem, that is, there are no responses YJ. (But see Section 
[7] for applications to supervised problems.) 

2. In spectral methods, smoothing is not the end goal. The main objective is finding structure 
in the data. The eigenvalues and eigenvectors of A £ provide information on the intrinsic 
geometry of the data and can be used to parameterize the data. 

3. In spectral smoothing, we are interested in A £)Tn for m > 1. The value m = 1 leads to a local 
analysis of the nearest-neighbor structure — this part is equivalent to classical smoothing. 
Powers m > 1, however, takes higher-order structure into account. 

The concept of connectivity is new in nonparametric statistics and is perhaps best explained in 
terms of stochastic processes. Introduce the forward Markov operator 

M e g{x)= [ a £ (y,x)g(y)dP(y) (15) 
J x 

and its m-step version M £jm . The first eigenfunction of M £ is ip E ,o(x) = s £ (x), the density of the 
stationary distribution. In general, 

<p £>e = s e (x)if) ert (x). 

The averaging operator A and the Markov operator M and (and hence also the iterates A £>m and 
M £>m ) are adjoint under the inner product (/, g) = f x f(x)g(x)dP(x), i.e. (A £ f, g) = (f, M £ g). 
By comparing ([7]) and the heat kernel of a continuous-time diffusion process (see equation (3.28) 
in |Grigor'yan (2006| )), we identify the time step of the discrete system as t = me. 

The Markov operator M £ = A* £ maps measures into measures. That is, let L l P (X) = {g : 
g(y) > 0,J g(y)dP(y) = 1}. Then g e Lp(X) implies that M £)Tn g e L l p {X). In particular, if 
cp is the probability density at time t — 0, then M £ , m Lp is the probability density after m steps. 
The averaging operator A £ maps observables into observables. Its action is to compute conditional 
expectations. If / E Lp°(X) is the test function (observable) at t = 0, then A £ ^ m f e Lp(X) is the 
average of the function after m steps, i.e. at a time comparable to t = me for a continuous time 
system. 
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3.2 Continuous Time 



Under appropriate regularity conditions, the eigenfunctions {ip £ /} converge to a set of functions 
{ifre} as e — > 0. These limiting eigenfunctions correspond to some operator. In this section we 
identify this operator. The key is to consider the Markov chain with infinitesimal transitions. 
In physics, local infinitesimal transitions of a system lead to global macroscopic descriptions by 
integration. Here we use the same tools (infinitesimal operators, generators, exponential maps, etc) 
to extend short-time transitions to larger times. 
Define the operator 

GJ{x) = \[J x <h{x, y)f(y)dP{y) - f{xfj . (16) 

Assume that the limit 

Gf = lim G e f = lim Aef ~ f (17) 

e^O e^O e 

exists for all functions / in some appropriately defined space of functions T. The operator G is 
known as the infinitesimal generator. A Taylor expansion shows that 

G = -A + ^ (18) 

V 

for smooth functions where A is the Laplacian and V is the gradient. Indeed, G £ f = — A/ + ^ + 
0(e) which is precisely the bias for kernel regression. 

Remark 2 In kernel regression smoothing, the term Vp/p is considered an undesirable extra bias, 
called design bias {[Fan (1993\ ). In regression it is removed by using local linear smoothing which 
is asymptotically equivalent to replacing the Gaussian kernel k £ with a bias-reducing kernel k*. In 
this case, G = —A. 

For i > define 

vl t = — and i/p = lim v% p. (19) 

The eigenvalues and eigenvectors of G £ are —u^ e and t/> £i £ while the eigenvalues and eigenvectors 
of the generator G are —v\ and ipg. Also, i\) e ^ ~ ip£. 



Let A t = lim £ ^ A Bit / e . From ( 16) and ( 17), it follows that 



A t = lim A £>t/£ = lim(7 + eG £ ) t/e = lim(J + eG) t/e = e Gt . (20) 



e^O ' e->0 " e^O 



The family {A t } t > defines a continuous semigroup of operators (Lasota and Mackey, 1994). The 
notation is summarized in Table [T] 

One of our goals is to find the bandwidth e so that A £:t / £ is a good estimate of A t . We show 
that this is a well-defined problem. Related work on manifold learning, on the other hand, only 
discusses the convergence properties of the graph Laplacian to the Laplace-Beltrami operator, i.e. 
the generators of the diffusion. Estimating the generator G, however, does not answer questions 
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Table 1: Summary of notation 



Operator Eigenfunctions Eigenvalues 

Ae-i „/,. ,,2 _ _ 



G = lim £ ^o ^ ^ = lim. 



A* = e tG = ZT=o i>t e-^I = lim^o A$ 
= lim £ ^o Ag^g 



regarding the optimal choice of the number of eigenvectors, the number of groups in spectral 
clustering etc. 

We can express the diffusion in terms of its eigenfunctions. Mercer's theorem gives the 
biorthogonal decomposition 



(21) 
(22) 



where ip £! t are the eigenvectors of A £ , and cp Bj i are the eigenvectors of its adjoint M £ . The details are 

I, it follows that the eigenvalues \ £ ^ = 1 — ev £t The averaging 
lave the same eigenvectors. Inserting (21 ) into ([9]) and recalling 



given in Appendix 8. 1 From (16 



operator A £ and its generator G 

that <p e Ax) = s £ (x)ip £ Ax), gives 



A £ f(x) = ^2\ £ ^eA x ) \ 



<p £ ,e(y)f(y)dP(y) 

^2h,ei>eA x ) / AAy)f(y) s e(y)dP(y) 

^2 X e,e^eA X )(^e,e, f)e = ^ A ^ U e,ef( 



i>0 



£>0 



where (f,g) £ = J x f(y)g(y)s £ (y)dP(y) and Yi £ ^ is the weighted orthogonal projector on the 
eigenspace spanned by ip £ ^. Thus, 



(23) 



£>0 



Similarly, assuming the limit in (20) exists, 



(24) 



£>0 
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where 11^ is the weighted orthogonal projector on the eigenspace corresponding to the eigenfunc- 
tion ip£ of G. Weyl's theorem ( |Stewart (1991| )) gives 



sup \e-^ - tfJH < \\A Ejt/e - e Gt \\ = te + 0(e 2 ), (25) 



lim A* 7 ! = e~^\ lim Ii e/ = U e 

e^O ' e— >0 



(26) 



Note that {ip^} is an orthonormal basis with respect to the inner product 

(f,g)e= / f(x)g(x)s E (x)dP(x) 



while {ifi} is an orthonormal basis with respect to the inner product 

(f,9h /£ = J -^-dPix). 



Equation (24) implies that to estimate the action of the limiting operator A t at a given time 
t > 0, we need the dominant eigenvalues and eigenvectors of the generator G. Finally, we also 
define the limiting transition density 

& t (x,y) = \ima £it/£ (x,y). (27) 

As t — > 0, a. t (x, y) converges to a point mass at x; as t — > oo, a t (x, y) converges to p(y). 

Remark 3 There is an important difference between estimating A t and G: the diffusion operator 
A t is a compact operator, while the generator G is not even a bounded operator. Consider, for 
example, the Laplacian on a circle S 1 (Rosenberg, 1997). The eigenfunctions ofG and A t are here 
the Fourier basis functions e t£x where £ = 0, ±1, ±2, . . .. The heat operator e~ tA is a compact 
operator. Its eigenvalues are e~ £ t (t > 0) which are clearly bounded above and go to zero. The 
Laplace-Beltrami operator A, on the other hand, has eigenvalues n 2 which are unbounded. 

We will consider some examples in Section [6] but let us first illustrate the definitions for a 
one-dimensional distribution with multiscale structure. 

Example 1 Suppose that P is a mixture of three Gaussians. Figure^shows the density p. The left 
column of Figure \5\shows Q t for increasing t. The right column shows a fixed row ofVt t , namely 
u>t{x, •) for a fixed x indicated by the horizontal line. The density u> t {x, ■) starts out concentrated 
near x. As t increases, it begins to spread out. It becomes bimodal at t = 1 indicating that the two 
closer clusters have merged. Eventually, the density has three modes (indicating a single cluster) 
att = 10, and then resembles p when t = 1000 since u> t (x, ■) — > p 2 {-)/ f p 2 {u)du as t — > oo. 
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3.3 Comparing e and t 



The parameters t and e are both related to smoothing but they are quite different. The parameter t 
is part of the population quantity being estimated and controls the scale of the analysis. Hence, the 
choice of t is often determined by the problem at hand. The parameter e is a smoothing parameter 
for estimating the population quantity from data. As n — > oo, we let e n — > for more accurate 
estimates. The following two examples illustrate the differences of smoothing in data when using 
e or t. 

Example 2 Consider a fixed test function f . Define 

A = \ g = AJ : < e < oo i and A* = { g = A t f : < t < oo 



Let 

where 5 C denotes a point mass distribution at c. If f is any continuous functions then both A e and 
A t depend only on the two values /(0) and /(l) which we will assume are distinct. 
Now 

AJ{x) = k e (x,0)f(0) + k e (x,l)f(l) 
k £ (x,0) + k £ (x, 1) 

In particular 

Anf(x)~l f{0) X<1/2 

and Aoofi^x) = c for all x where c = (/(0) + /(l))/2. For < e < oo, A e f(x) is a smooth 
monotone function; see Figure^ 
In contrast, 

' f(0) x < 1/2 
/(I) x>l/2 

for all values oft. In other words, A t f(x) = A f(x) for all t. The reason is that A t has two 
eigenf unctions: ipo{x) = 1 and tpi(x) = I(x > 1/2) — I(x < 1/2) (assuming the normalization 
J ip 2 (x)dP(x) = 1.) The eigenvalues are Ao = Ai = 1. Hence, u = v\ = and so 

A t = n + iii 

where n projects onto ipo and U\ projects onto ipi- The step function behavior of A t reflects the 
lack of connectivity of P. 



A*/(x) 



Example 3 Assume that the distribution is supported along two parallel lines of length ir at v = 
and v — 1, respectively. The probability measure is 

p = \ u «+\ Ul 
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Figure 6: A e f for increasing values of e. The function A t f corresponds to the top left plot and does not change as 
t changes. 

where Uq is uniform on {(0,x) : < x < 7r} and U\ is uniform on {(l,x) : < x < 7r}. 
Consider a fixed test function f. We have that 



parameter e. When e is small, A e f(x) will only depend on the values off close to the origin along 
the line at v = 0. However, with increasing e, smoothing will also involve function values further 
from the origin, including values along the parallel line at v = 1, as indicated by the red dashed 
curves in the figure. 

In contrast, for x = (0, 0), A t f(x) only depends on values of f in the same connected set as 
x, i.e. function values along the line at v — 0, regardless oft. Figure^illustrates how the weights 
£t(x, y) change as the parameter t increases. Smoothing by t reflects the connectivity of the data. 
In particular, there is no mixing of values of f from disconnected sets. 




where the weights £ £ (x, y) 
Let x = (0, 0) and y 
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Figure 7: 40, y) forx = (0,0), y = (u,v) ande = 0.01,0.1,1,10. 

4 Diffusion Distance 

The diffusion distance is another quantity that captures the underlying geometry. 
4.1 Definition 

For an m-step Markov chain, the diffusion distances are defined by 

for m = 1, 2, It can be shown (see appendix) that 

D 2 e , m (x, z) = J2 X T^M*) ~ <M^)) 2 - (28) 

Following the same arguments as before we deduce that the corresponding population quantity is 

Now we compare diffusion distance to two other distances that have been used recently: geodesic 
distance and density distance. 
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Figure 8: t t {x,y) forx = (0,0), y = (u,v) andi = 0.01,0.1,1,10. 

4.2 Geodesic Distance 

The geodesic distance, or the shortest path, is a very intuitive way of measuring the distance be- 



tween two points in a set. Some manifold learning algorithms, such as Isomap (Tenenbaum et al., 
2000|), rely on being able to estimate the geodesic distance on a manifold given data in W. The 
idea is to construct a graph G on pairs of points at a distance less than a given threshold S, and 
define a graph distance 



dc{A, B) = min (||x — xi\\ + . . . + \\x, 



m—l 



Xm\\ ) 



where n = (x , . . . , x m ) varies over all paths along the edges of G connecting the points A = x 
and B = x rn . Multidimensional scaling is then used to find a low-dimensional embedding of the 
data that best preserves these distances. 



Under the assumption that the data lie exactly on a smooth manifold M., Bernstein et al. (2000) 



have shown that the graph distance dc(A, B) converges to the geodesic manifold metric 

d M (A,B) = mf{length( 7 )}, 

where 7 varies over the set of smooth arcs connecting A and B in Ai. Beyond this ideal situation, 
little is known about the statistical properties of the graph distance. Here we show by two examples 
that the geodesic (graph) distance is inconsistent if the support of the distribution is not exactly on 
a manifold. 
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Consider a one-dimensional spiral in a plane: 



x = t a cos(fot) 
y = t a sin(fet) 

where a = 0.8 and b = 10. The geodesic manifold distance djVt(A B) between two reference 
points A and B with t = n/2b and t = /2b, respectively, is 3.46. The corresponding Euclidean 
distance is 0.60. 

Example 4 (Sensitivity to noise) We first generate 1000 instances of the spiral without noise, 
(that is, the data fall exactly on the spiral) and then 1000 instances of the spiral with exponential 
noise with mean parameter f3 = 0.09 added to both x and y. For each realization of the spiral, we 
construct a graph by connecting all pairs of points at a distance less than a threshold r. Figure^ 
shows histograms of the relative change in the geodesic graph distance (top) and the diffusion 
distance (bottom) when the data are perturbed. (The value corresponds to no change from the 
average distance in the noiseless cases). For the geodesic distance, we have a bimodal distribution 
with a large variance. The mode near —0.15 corresponds to cases where the shortest path between 



A and B approximately follows the branch of the spiral; see Figure 10 (left) for an example. The 
second mode around —0.75 occurs because some realizations of the noise give rise to shortcuts, 
which can dramatically reduce the length of the shortest path; see Figure\lO^( right) for an example. 
The diffusion distance, on the other hand, is not sensitive to small random perturbations of the 
data, because unlike the geodesic distance, it represents an average quantity. Shortcuts due to 
noise have little weight in the computation, as the number of such paths is much smaller than the 
number of paths following the shape of the spiral. This is also what our experiment confirms: 
Figure^(bottom) shows a unimodal distribution with about half the variance as for the geodesic 
distance. In our experiment, the sample size n = 800 and the neighborhood size r = 0.15. To be 
able to directly compare the two methods and use the same parameters, we have for the diffusion 
distance calculation digressed from the Gaussian kernel and instead defined an adjacency matrix 
with only zeros or ones, corresponding to the absence or presence of an edge, respectively, in the 
graph construction. 



Example 5 (Consistency) For a distribution not supported exactly on a manifold, the problem 
with shortcuts gets worse as the sample size increases. This is illustrated by our next experiment 
where the noise level (3 = 0.09 and the neighborhood size r = 0.1 are fixed, and the sample size 
n = 600, 2000 and 4000. Figure\lljshows that for a small enough sample size, the graph estimates 
are close to the theoretical value d_M = 3.46. For intermediate sample sizes, we have a range of 
estimates between the Euclidean distance \\xa — %b\\ = 0.6 and d_M- As n increases, shortcuts 
are more likely to occur, with the graph distance eventually converging to the Euclidean distance 
in the ambient space. 
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Geodesic distance 




4.3 Density Sensitive Metrics 

In certain machine learning methods, such as semisupervised learning, it is useful to define a den- 
sity sensitive distance for which x and y are close just when there is a high density path connecting 



x and y. This is precisely what diffusion distances do. Another such metric is (Bousquet et al., 
20031 

f y ds 

* ,s)=i « f X mar) 

where the infimum is over all smooth paths (parameterized by path length) connecting x and y. 
The two metrics have similar goals but D t (:r, y) is more robust and easier to estimate. Indeed, to 
find d(x, y) one has to examine all paths connecting x and y. 



23 



Figure 10: Two realizations of a noisy spiral. The solid line represents the shortest path between two reference 
points A and B in a graph constructed on the data. 
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Figure 1 1: Inconsistency of the geodesic graph distance. Distribution of the geodesic distance for (top to bottom) 
sample sizes n = 600, 2000 and 4000. The dashed vertical lines indicate the Euclidean distance in the ambient space 
and the geodesic distance of the ideal manifold. 
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5 Estimation 



Now we study the properties of A e it / e as an estimator of A t . Let Il e e be the orthogonal projector 
onto the subspace spanned by ip e j and let IL; be the projector onto the subspace spanned by 
Consider the following operators: 

A t (e,P) = A £)t/£ = ££o A S n M> A t (e,q,P) = 
A t {e,q,P n ) = ELoX^i, A t 

where ipe,i and \ et g denote the eigenfunctions and eigenvalues of A s , and ip et £ and A e ^ are the 
eigenf unctions and eigenvalues of the data-based operator A £ . Two estimators of A t are the trun- 
cated estimator A t (s, q, P n ) and the non-truncated estimator A t (s, P n ) = e*^ -1 " 6 . In practice, 
truncation is important since it corresponds to choosing a dimension for the reparameterized data. 



5.1 Estimating the Diffusion Operator A t 

Given data with a sample size n, we estimate A t using a finite number q of eigenfunctions and a 
kernel bandwidth e > 0. We define the loss function as 

L n (e,q,t) = \\A t -A t {e,q,P n )\\ (30) 



where = supj gjr H-B/H2/H/II2 and ||/|| 2 = J f x f 2 (x)dP(x) where T is the set of uniformly 
bounded, three times differentiable functions with uniformly bounded derivatives whose gradients 
vanish at the boundary. By decomposing L n into a bias-like and variance-like term (Figure 12), 
we derive the following result for the estimate based on truncation. Define 

00 

p(t) = Y^e-^. (31) 

e=i 

Theorem 1 Suppose that P has compact support, and has bounded density p such that mf x p(x) > 
and sup x p(x) < 00. Let e n — > and net/ 2 / log(l/e n ) — > 00. Then 



00 



L n (e n ,q,t)=p(t) ( P ( \\ l0g 2ilt^ ) +0(e n ) ) + 0(1) (32) 



USr. 



9+1 



77ze optimal choice of e n is e n x (log n/n) 2 ^ d+8 ' ) in which case 

2/(d+8) x 



^ +0(l)^e-^. (33) 



g+i 

We also have the following result which does not use truncation. 
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Theorem 2 Define 

A t (e,P n ) = e t ^' 1 ^ 

Then, 



\A t -A t (e n ,P n )\\ =\0 P \ \ hj^i ) +0(e n ) | x p(t) (34) 
The optimal e n is e n x (logn/n) 2 ^ d+8 \ With this choice, 

||A t - A t (e, P n ) || = Op l-j- j xp(t). 
Let us now make some remarks on the interpretation of these reults. 

1 . The terms 0{e n ) and Y^+i e ~ u ^ correspond to bias. The term Op [^J log (d{t^ j corresponds 
to the square root of the variance. 

2. The rate n~ 2 K d+ ^ is slow. Indeed, the variance term l/(ne^ +4 ^ 2 ) is the usual rate for 
estimating the second derivative of a regression function which is a notoriously difficult 
problem. This suggests that estimating A t accurately is quite difficult. 

3. We also have that 



ne 



n 



and the first term is slower than the rate nen +2 ^ 2 in Gine and Koltchinskii (2006) and 
Singer (2006). This is because they assume a uniform distribution. The slower rate comes 
from the term p £ (x) — p £ (x) which cannot be ignored when p is unknown. 

4. If the distribution is supported on a manifold of dimension r < d then e^+^Z 2 becomes 
£ (r+4)/2 j n b e t W een full support and manifold support, one can create distributions that are 
concentrated near manifolds. That is, one first draws Xj from a distribution supported on a 
lower dimensional manifold, then adds noise to This corresponds to full support again 
unless one lets the variance of the noise decrease with n. In that case, the exponent of e can 
be between r and d. 



5. Combining the above results with the result from Zwald and Blanchard (2006) , we have that 



^-vw!i = I op I \MjM) + °( e »n h — —y 

1 1 » ne ( n }/ J J mmo<j<t(Vj - vf-i) 

6. The function p{t) is decreasing in t. Hence for large t, the rate of convergence can be 
arbitrarily fast, even for large d. In particular, for t > p _1 (n _ ( d+4 )^ 2 ^ +8 ^) the loss has the 
parametric rate Op(y/\ogn/n). 
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A t (s,q,P n ) 

Figure 12: Decomposition of the loss into bias and variance. 

7. An estimate of the diffusion distance is 

oo 
1=0 

The approximation properties are similar to those of A t . 

8. The dimension reduction parameter q should be chosen as small as possible while keeping 



the last term in (32 ) no bigger than the first term. This is illustrated below. 



Example 6 Suppose that vg = $ for some j3 > 1/2. Then 

L n (e n ,q,t) = ^m 0p {^Ji^s)) +C2e ■ 



The smallest q n such that the last term in (32 ) does not dominate is 



^logt + ^logn 



1/(2/3) 



and we get 



L n (e n ,q,t) - P (^T/wj n 2/fd+8)^J ■ 



5.2 Nodal Domains and Low Noise 

An eigenfunction ipg partitions the sample space into regions where ip£ has constant sign. This 
partition is called the nodal domain of ip£. In some sense, the nodal domain represents the basic 
structural information in the eigenfuction. In many applications, such as spectral clustering, it is 
sufficient to estimate the nodal domain rather than ip e . We will show that fast rates are sometimes 
available for estimating the nodal domain even when the eigenfunctions are hard to estimate. This 
explains why spectral methods can be very successful despite the slow rates of convergence that 
we and others have obtained. 
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Formally, the nodal domain of if)g is JVg = {Ci, . . . , Ck} where the sets Ci, . . . , C& partition 
the sample space and the sign of ip£ is constant over each partition element Cj. Thus, estimating 
the nodal domain corresponds to estimating Hg{x) = sign(ipe(x)). ^ 

Recently, in the literature on classification, there has been a surge of research on the so-called 
"low noise" case. If the data have a low probability of being close to the decision boundary, then 
very fast rates of convergence are possible even in high dimensions. This theory explains the 
success of classification techniques in high dimensional problems. In this section we show that a 
similar phenomema applies to spectral smoothing when estimating the nodal domain. 

Inspired by the definition of low noise in |Mammen and Tsybakov (1999] ), |Audibert and Tsy 



bakov (2007] ), and Kohler and Krzyzak (2007]), we say that P has noise exponent a if there exists 



C > such that 

P(0 < \M X )\ < 5) < CS a . (35) 

We are focusing here on ipi although extensions to other eigenfunctions are immediate. Figure 



13 shows 4 distributions. Each is a mixture of two Gaussians. The first column of plots shows 
the densities of these 4 distributions. The second column shows ipi- The third column shows 
< \ipi(X) | < 5). Generally, as clusters become well separated, ipi behaves like a step function 



and P(0 < l^ipT)! < 5) puts less and less mass near which corresponds to a being large. 



Theorem 3 Let H{x) = sign(/0i(;r)) and H(x) = signal (#)). Suppose that (35) holds. Set 

£n = n -2/(4a+d+8)_ Then> 

F(H(X) ^ H(X)) < n'^rhd (36) 

where X ~ P. 

Note that, as a — > oo the rate tends to the parametric rate n~ l l 2 . 

5.3 Choosing a Bandwidth 

The theory we have developed gives insight into the behavior of the methods. But we are still left 
with the need for a practical method for choosing e. Given the similarity with kernel smoothing, it 
is natural to use methods from density estimation to choose e. In density estimation it is common 
to use the loss function J (p(x) — p £ (x)) 2 dx which is equivalent, up to a constant, to 

C(e) = J p 2 £ {x)dx — 2 J p e {x)p{x)dx. 
A common method to approximate this loss is the cross-validation score 



£( £ ) = / pi( x ) dx — ^2psA x 

J 71 8=1 



3 If ip is an eigenfunction then so is —ip. We implicitly assume that the sign ambiguity of the eigenfunction has 
been removed. 
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Figure 13: Each row corresponds to a mixture of two Gaussians. The first column of plots shows the densities of 
these distributions. The second column shows ipi. The third column shows P(0 < |^i(X)| < 5) as a function of S. 
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where p £:i is the same as p £ except that X { is omitted. It is well-known that C(e) is a nearly 
unbiased estimate of C(e). One then chooses e n to minimize C{e). 

The optimal e* n from our earlier result is (up to log factors) 0(n~ 2 ^ d+8 ^) but the optimal band- 
width e° n for minimizing C is 0(rT 2 ^ d+ ^). Hence, e*Je° n x n 8/((d+4)(d+8))_ This suggests t h a t 
density cross-validation is not appropriate for our purposes. 

Indeed, there appears to be no unbiased risk estimator for this problem. In fact, estimating 
the risk is difficult in most problems that are not prediction problems. As usual in nonparametric 
inference, the problem is that estimating bias is harder than the original estimation problem. In- 
stead, we take a more modest view of simply trying to find the smallest e such that the resulting 
variability is tolerable. In other words, we choose the smallest e that leads to stable estimates of the 
eigenstructure (similar to the approach for choosing the number of clusters in Lange et al. (2004)). 
There are several ways to do this as we now explain. 



Eigen-Stability. Define ip et (x) = E( , e ^(x)). Although ip ei ^ ijj£, they do have a similar 
shape. We propose to choose e by finding the smallest e for which ip e e can be estimated with a 
tolerable variance. To this end we define 



SNR(e) 



\\1> 



el\\2 



(37) 



el\\2 



which we will refer to as the signal-to-noise ratio. When e is small, the denominator will dominate 
and SNR(e) m 0. Conversely, when e is large, the denominator tends to so that SNR(e) gets 
very large. We want to find Eq such that 



e = infle : SNR( £ ) > K % 



for some K„ > 1. 

We can estimate SNR as follows. We compute B bootstrap replications 

2W 2(B) 

Ye,£ > • • ■ ' Ve,l ■ 



We then take 



where c + = max{c, 0}, 



SNR(e) 



\ 



e,£\\2 



2 _ £ 2 



(38) 



ell 12 



6=1 



and ^£ = B 1 J2f=i ^ie- ^ ote mat we subtract ^ 2 from the numerator to make the numerator 
approximately an unbiased estimator of | \ip e t \ | 2 . Then we use 



e = mm{e : SNR(e) > K n }. 
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We illustrate the method in Section [6] For K n = Cn 2 ^ d+S \ where C is a constant, the optimal e is 

Q( n -2/(d+8))_ Tq write 

i,e( x ) = ^?X X ) + K x ) + £0*0 
where b(x) denotes the bias and £(x) = ip £t e(x) — ipi(x) — b(x) is the random component. Then 

\\M x ) + K x )\\ 2 o(i) 



SNR fel 



m\\ 2 "op(^ y7 ,)- 



Setting this equal to K% yields e Q = 0(n~ 2/(d+8) ). 

The same bootstrap idea can be applied to estimating the nodal domain. In this case we define 



SNR(e) 



H II 2 - P'< 



(39) 



where 



1 B 

£ 2 = -B22\\ H e4 ~ H e,e\\2 



B 

6=1 



Neighborhood Size Stability. Another way to control the variability is to ensure that the 
number of points involved in the local averages does not get too small. For a given e let iV = 
{N%, . . . ,N n } where Ni = #{Xj : \\Xi — Xj\\ < V2e}. One can informally examine the 
histogram of iV for various e. A rule for selecting e is 

e~= min{e : medianjiVi, . . . , A^„} > k}. 

We illustrate the method in Section 

An alternative, suggested by | von Luxburg (2007] ), is to choose the smallest e that makes the re- 



sulting graph well-connected. This leads to e = O ((log n/n) 1 ^). More specifically, von Luxburg 



(2007 ) suggests to "... choose e as the length of the longest edge in a minimal spanning tree of the 



fully connected graph on the data points." 



6 Examples 

6.1 Two Gaussians 

Let 

p(a;) = ^(x;-2,l) + ^(x;2,l) 
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where <f>(x; //, a) denotes a Normal density with mean /j and variance a 2 . Figure 14 shows the 
error — ij) e i\\ as a function of e for a sample of size n = 1000. The results are averaged over 
approximately |j 200 independent draws. A minimal error occurs for a range of different values of 
e between 0.03 and 0.1. The variance dominates the error in the small e region (e < 0.03) , while 



the bias dominates in the large e region (e > 0.1). These results are consistent with Figure 15 



which shows the estimated mean and variance of the first eigenvector ip £> i for a few selected values 



of e (e — 0.02,0.03,0.1, 1), marked with blue circles in Figure 14 Figures 16 18 show similar 
results for the second, third and fourth eigenvectors ip 2 , ^3; ^4- Note that even in cases where the 
error in the estimates of the eigenvectors is large, the variance around the cross-over points (where 
the eigenvectors switch signs) can be small. 



Figure|l9|(left) shows a histogram of N = #{Xj : ||X—X,-|| < V2e} fore = 0.02, 0.03, 0.1, 1 
and n = 1000. All results are averaged over 500 independent simulations. The vertical dashed lines 
indicate the median values. For this particular example, we know that the error is small when e 



is between 0.03 and 0.1. This corresponds to median{A^!, . . . , N n } being around 100. Figure 19 
(right) shows a histogram of the distance to the A;-nearest neighbor for k = 100. The median value 
0.32 (see vertical dashed line) roughly corresponds to the tuning parameter e = 0.32 2 /2 = 0.05. 



Choosing the Bandwidth Using SNR. Figure 20 (line with circles) shows the signal-to-noise 
ratio for tpx with n = 1000, estimated by simulation. For each simulation, we also computed 
the bootstrap estimate of SNR and averaged this over the simulations. The result is the line with 



triangles. The dashed lines in Figure 21 represent bootstrap estimates of SNR for three typical data 
sets. The resulting using SNR = 5 are shown to the right. For all three data sets, the bootstrap 
estimates of ipi (dashed lines) almost overlap the true eigenvector (solid line). 



Estimating the Nodal Domain. Now consider estimating H e (x) = sign(?/v(x)). Figure 22 
shows the nodal domain error for H e when I = 1,2, 3, 4, estimated by simulation. We see that 
the error is relatively small and stable over e. As predicted by our results, large e can lead to very 
low error. We can use the instability measure B,(e,£) = F(Hi(X) ^ Hi(X)), where Hi(x) = 
sign(^ £j £(x)), to choose e and q. For example, find the smallest e and the largest number q of 
eigenvectors such that E(e, t) < a for all t < q. (In this case, a = 0.2 approximately corresponds 
to £ = 0.075 and q = 4.) We should caution the reader, however, that stability-based ideas have 
drawbacks. In clustering, for example, |Ben-David et al. (2006[ ) showed that choosing the number 
of clusters based on stability can lead to poor clusters. 

6.2 Words 

The last example is an application of SCA to text data mining. The example shows how one can 
measure the semantic association of words using diffusion distances, and how one can organize 
and form representative "meta-words" by eigenanalysis and quantization of the diffusion operator. 

The data consist of p = 1161 Science News articles. To encode the text, we extract n = 1004 
words (see |Lafon and Lee (20 06) for details) and form a document- word information matrix. The 



4 We discard simulations where A x = A = 1 for e = 0.02. 
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I 




Figure 14: The error \\tpi — tpe,i\\ m the estimate of the first eigenvector as a function of e. For each e (red dots), an 
average is taken over approximately 200 independent simulations with n = 1000 points from a mixture distribution 
with two Gaussians. Figure 15 shows the estimated mean and variance of ij] e i for e = 0.02, 0.03, 0.1, 1 (blue circles) 
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Figure 15: The first eigenvector ij) e i for e = 0.02, 0.03, 0.1, 1 and n — 1000. The red dashed curves with shaded 
regions indicate the mean value ± two standard deviations for approximately 300 independent simulations. The black 
solid curves show ip e> i as e — > 0. 

£=0.02 £=0.03 




-4 -2 2 4 -4 -2 2 4 

Figure 16: The second eigenvector ip St 2 for e = 0.02, 0.03, 0.1,1 and n = 1000. The red dashed curves with shaded 
regions indicate the mean value ± two standard deviations for approximately 300 independent simulations. The black 
solid curves show ip Et 2 as e — > 0. 
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e=0.03 





-2 2 4 



Figure 17: The third eigenvector ip s>3 for e = 0.02, 0.03, 0.1, 1 and n — 1000. The red dashed curves with shaded 
regions indicate the mean value ± two standard deviations for approximately 300 independent simulations. The black 
solid curves show ip e 3 as e — > 0. 



£=0.02 



£=0.03 




2 2 4 




2 2 4 



Figure 18: The fourth eigenvector ^p eA for e = 0.02, 0.03, 0.1, 1 and n = 1000. The red dashed curves with shaded 
regions indicate the mean value ± two standard deviations for approximately 300 independent simulations. The black 
solid curves show ip eA as s — > 0. 
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Figure 19: Left: Histogram of N t = #{Xj : \\X t - X 3 \\ < V%e} for e = 0.02,0.03,0.1, 1 and n = 1000. 
The vertical dashed lines indicate the median values. Right: Histogram of the distance to the fc-nearest neighbor for 
k = 100. The median value 0.32 (vertical dashed line) roughly corresponds to e = 0.32 2 /2 = 0.05. All results are 
averaged over 500 independent simulations. 



mutual information between document x and word y is defined as 



log 



2~^£ y Y2r) f$,V 

where f XjV = c Xjy /n, and c x>y is the number of times word y appears in document x. Let 

e y = [-^1,3/5 ^2,2/> • • • Ip,y\ ■ 

be a p-dimensional feature vector for word y. 

Our goal is to reduce both the dimension p and the number of variables n, while preserving the 
main connectivity structure of the data. In addition, we seek a parameterization of the words that 
reflect how similar they are in meaning. Diffusion maps and diffusion coarse-graining (quantiza- 
tion) offer a natural framework for achieving these objectives. 



Define the weight matrix K.(i,j) = exp 



ll e i- e jlh 

4e 



for a graph with n nodes. Let 



be 



the corresponding m-step transition matrix with eigenvalues A™ and eigenvectors ip£. Using the 



bootstrap, we estimate the SNR of ipx as a function of e (Figure 23 left). A SNR cut-off at 2, gives 
the bandwidth e = 150. Figure 23 , right, shows the spectral fall-off for this choice of e. For m = 3 
and q = 12, we have that (Ag/Ai)" 1 < 0.1, i.e we can obtain a dimensionality reduction of a factor 



of about 1/100 by the eigenmap e y G 



(XTMv), WMv), ■ ■ ■ , KMv)) e ra* without 



losing much accuracy. Finally, to reduce the number of variables n, we form a quantized matrix 



A em for a coarse-grained random walk on a graph with k < n nodes. It can be shown (Lafon 
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Figure 20: True signal-to-noise ratio estimated by simulation (rings) and mean of the bootstrap estimated signal-to- 
noise ratio (triangles) as a function of e. 
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Figure 2 1 : Left: Signal-to-noise ratio, as a function of e, estimated by simulation (solid black line), and by the 
bootstrap for three different data sets (dashed lines). Right: tpi (solid black line) and resulting bootstrap estimates of 
ipi using SNR = 5 (dashed lines). 



0.45 




Figure 22: The nodal domain error F(H e (X) ^ H e (X)) as a function of e for I = 1,2, 3, 4. 
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Figure 23: Left: Estimated SNR of ipi as a function of e. Right: Decay of the eigenvalues of A e-m for e = 150 and 
to = 1,2,3,4. 



and Lee, 2006), that the spectral properties of A e m and A e m are similar when the coarse-graining 



(quantization) corresponds to A;-means clustering in diffusion space. 

Figure|24]shows the first two diffusion coordinates of the cluster centers (the "meta- words") for 
k = 100. These representative words have roughly been rearranged according to their semantics 
and can be used as conceptual indices for document representation and text retrieval. Starting to the 
left, moving counter-clockwise, we here have words that express concepts in medicine, biology, 
earth sciences, physics, astronomy, computer science and social sciences. Table [2] gives examples 
of words in a cluster and the corresponding word centers. 



7 Discussion 

Spectral methods are rapidly gaining popularity. Their ability to reveal nonlinear structure makes 
them ideal for complex, high-dimensional problems. We have attempted to provide insight into 
these techniques by identifying the population quantities that are being estimated and studying 
their large sample properties. 

Our analysis shows that spectral kernel methods in most cases have a convergence rate similar 
to classical non-parametric smoothing. Laplacian-based kernel methods, for example, use the 
same smoothing operators as in traditional nonparametric regression. The end goal however is not 
smoothing, but data parameterization and structure definition of data. Spectral methods exploit 
the fact that the eigenvectors of local smoothing operators provide information on the underlying 
geometry and connectivity of the data. 

We close by briefly mention how SCA and diffusion maps also can be used in clustering, 
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Figure 24: Parameterization and grouping of words using diffusion mapping. The text labels the representative word 
centers (meta-words) in each group. Note that the words are roughly arranged according to their semantic meaning. 
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Table 2: Examples of word groupings 


VVtJlCl CtlllCl 


rvCllld.ll 11 llg, WUlLl?) ill giVJUU 


virus 


aius, diiergy, niv, vaccine, virai 


reprou uc li ve 


iruiL, iiiaie, oiispring, reproductive, sex, sperm 


VlLallllll 


cdiory, uriiiKing, icia, sugar, suppiemeiiL, vegeLaDie 


fever 


epidemic, leuicu, ouLDreaK, loxiii 


CLUay a LClll 


f^pnl r\ (Ti ct ti c n Trxrf^ct m an np n \'pr cm 1 trr\ni p nl 
cnjiugioL, iimi, luicaL, liiaiiiic, iivci, ouii, Liuui^ai 


warming 


climate, el, nino, forecast, pacific, rain, weather, winter 


geologic 


beneath, crust, depth, earthquake, plate, seismic, trapped, volcanic 


laser 


atomic, beam, crystal, nanometer, optical, photon, pulse, quantum, semiconductor 


hubble 


dust, gravitational, gravity, infrared 


galaxy 


cosmic, universe 


finalist 


award, competition, intel, prize, scholarship, student, talent, winner 



density estimation and regression. The full details of these applications will be reported in separate 
papers. 

7.1 Clustering 

One approach to clustering is spectral clustering. The idea is to reparameterize the data using the 
first few nontrivial eigenvectors ipi, . . . , ip m and then apply a standard clustering algorithm such as 
A>means clustering. This can be quite effective for finding non-spherical clusters. 

Diffusion distances can also be used for this purpose. Recall that ordinary A;-means clustering 
seeks to find points C = {c\, . . . , c^} to minimize the empirical distortion 

1 n 

A(C) = -J2\\X l -Qc(X t )\\ 2 

i=l 

where Qc is the quantization map that takes Xj to the closest Cj E C. The empirical distortion 
estimates the population distortion 

A(C)=E||X-Q C (X)|| 2 . 
By using /c-means in diffusion map coordinates we instead minimize 

A(C)=ED t (X,Q t (X)) 



where Q t (x) = argmin cgC .D i (x, c). The details are in Lee and Wasserman (2008 1. 



7.2 Density Estimation 



If Q is a quanization map then the quantized density estimator (Meinicke and Ritter, 2002) is 

n 

m = -Y,^ Kh{h ~ Q{m) - 

i=l 
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For highly clustered data, the quantized density estimator can have smaller mean squared error 
than the usual kernel density estimator. Similarly, we can define the quantized diffusion density 
estimator as 



=1 



which can potentially have small mean squared error for appropriately chosen t. See Buchman 



et al. (2008 ) for an application to density estimation of hurricane tracks in the Atlantic Ocean. 



7.3 Regression 

A common method for nonparametric regression is to expand the regression function m(x) = 
E(Y\X = x) in a basis and then estimate the coefficients of the expansion from the data. Usually, 
the basis is chosen beforehand. The diffusion map basis provides a natural data-adaptive basis for 
doing nonparametric regression. We expand m(x) = K(Y\X = x) as m{ x) = (3jipj{x). Let 

™( x ) — Y^j=i Pj^Pejix) where q and e are chosen by cross-validation. See Richards et al. (2009) 
for an application to astronomy data and spectroscopic redshift prediction. 



8 Appendix 

8.1 Spectral Decomposition and Euclidean Distances in Diffusion Space 

In this section, we describe how a symmetric operator A, the stochastic differential operator A 
and its adjoint (the Markov operator) A* are related, and how these relations lead to different 
normalization schemes for the corresponding eigenvectors. (For ease of notation, we have omitted 
the subindex e, since we here consider a fixed e > 0.) We also show that the diffusion metric 
corresponds to a weighted Euclidean distance in the embedding space induced by the diffusion 
map. 

Suppose that P is a probability measure with a compact support X. Let k : X x X be a 
similarity function that is symmetric, continuous, and positivity-preserving, i.e. k(x,y) > for all 
x, y 6 X. For simplicity, we assume in addition that k is positive semi-definite, i.e. for all bounded 
functions / on X, J x f x k(x, y)f(x)f(y)dP(x)dP(y) > 0. Consider two different normalization 
schemes of k: 

a(x, y) = J^>y)^ = (symmetric) 

a(x,y) = (stochastic) 

where p(x) = J k(x,y)dP(y). 

Define the symmetric integral operator A by 

Af(x)= [ a(x,y)f(y)dP{y). 
J x 
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Under the stated conditions, k(x,y) is an L 2 -kernel. It follows that A is a self-adjoint compact 
operator. The eigenvalues {\e}e>o of A are real and the associated eigenfunctions {ve}e>o form an 
orthonormal basis of L 2 {X; dP). According to Mercer's theorem, we have the spectral decompo- 
sition 

ofo v)=^2 ^iVi{x)v t (y), ( 40 ) 

where the series on the right converges uniformly and absolutely to a(x, y). 

Now consider the integral operator A and its adjoint (the Markov operator) A*: 

Af{x) = f x a{x,y)f(y)dP{y) 
A* fix) = J x f(y)a(y,x)dP(y), 

where (Af,g) L 2 {x . dP) = (/, A* g) L 2 {x . dP) . Let s(x) = p{x)/ f p(y)dP(y). If Av t = X e v e , then 
we have the corresponding eigenvalue equations 

Aip £ = \ e ip e , where ip e (x) = v t {x)/yjs{x) (41) 

and 

A*(p t = X e <pi, where <p e (x) = Vi(x)y/s(x). (42) 

Moreover, if {ve}i>o is an orthonormal basis of L 2 (X;dP), then the sets {ipi}e>o and {y>t}t>a 
form orthonormal bases of the weighted L 2 -spaces L 2 (X; sdP) and L 2 (X;dP/s), respectively. 
The operator A preserves constant functions, i.e. Al = 1. One can also show that the matrix 

norm \\A\\ = ^Vf^L 2 {x-,dP) yur = 1- Thus, the eigenvalue A = 1 is the largest eigenvalue of 
the operators A and A*. The corresponding eigenvector of A is ipo = 1, and the corresponding 
eigenvector of A* is <p = s. 



From Eq. 40 it follows that 

<*(z,y) = ^2Xiipi(x)ipi(y), 

£>0 

where \\(p£\\ L \X;dP/s) = IMl^x-mp) = 1 for a11 ^ ^ °' and ^t)ti>{X;dP) = for k ^ t. More 
generally, if a m (x, y) is the kernel of the m th iterate A m , where m is a positive integer, then 

a m {x, y) = J2 X 7M%)¥i(y)- (43) 

We define a one-parametric family of diffusion distances between points x and z according to 

D m(x,z) = \\a m (x,-) -a m (z,-)\\ 2 L2{x . dP/s) , (44) 

where the parameter m determines the scale of the analysis. The diffusion metric measures the 
rate of connectivity between points on a data set. It will be small if there are many paths of lengths 
less than or equal to 2m between the two points, and it will be large if the number of connections 



is small. One can see this clearly by expanding the expression in Eq. 44 so that 



a 2m (x } x) a 2m (z,z) fa 2m (x,z) a 2m (z } x) 
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The quantity D^x, z) is small when the transition probability densities a 2m {x, z) and a 2rn (z, x) 
are large. 

Finally, we look for an embedding where Euclidean distances reflect the above diffusion met- 
ric. The biorthogonal decomposition in Eq. [43] can be viewed as an orthogonal expansion of the 
functions a m (x, ■) with respect to the orthonormal basis {(pe}t>o of L 2 (X; dP/s); the expansion 
coefficients are given by {\™i)t(x)}e> Q . Hence, 

D 2 m ( X ,z) = J2( X ?M%) ~ K l Mz)Y = \\*m(x) ~ * m («)|| a , 

where \I/ m : x i— ► (A™V'i(x), A™^ 2 (^), • • •) is the diffusion map of the data at time step m. 



8.2 Proofs 

Proof of Theorem [1} From Theorem 2 below, we have that 



\Me n ,P n )-A t \\ =[0 P [ xl ^ili 1 +0(e n ) ) x p(t). 



Hence, 

\\A t {e n ,q,P n ) - A t \\ < \\At(e n ,q,P n )-At(e n ,P n )\\ + \\A t (e n ,P n )-A t \ 



\A t (e n ,q,P n ) - A t (e n ,P n )\\ + [O p [ J ^Sffi I + °( £ n) ) x p(t) 



+ 0{e n ) x p{t) 



q+l \ \ V n£r 

Now we bound the first sum. Note that, 

1^-2 2 I A E | — A £ ^ A £ — y4 e 
SU P K,/ - = SU P < ~ = °p(7n) 

where 

'log(l/e n ) 



In 

By a Taylor series expansion, 



G E J = Gf + 0(e n ) 

uniformly for / £ T . (This is the same calculation used to compute the bias in kernel regression. 



See also, Gine and Koltchinskii (2006 ) and Singer (2006)). So 



SU P We n ,£ ~ v t I < l|G en - G|| = 0(e r 
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Therefore, 



Eft 

9+1 



Ed 

9+1 

exp | — log(l - £n^) | 

9+1 L £n > 

exp { — log(l - £ n [Op(7n) + 0{e n ) - ) 

9+1 L £n > 

oo 

(l + P ( Tn )+0(£ n ))^e-^. 

9+1 



In conclusion, 

\\A t {e n ,q,P n ) - A t | 



0/ 



'log(l/e r 



(d+4)/2 



0(e«) x 



+ [l + P ( 7rl )+0(£n))£ e ^ 
^ ' 9+1 



ne 



9+1 



Proof of Theorem 2} Recall that A t (£ n , P n ) = e i(A£ ™~ 7)/£ ™. From lemma 1, ||A e -A E 
where 



7(e) = Op 



'log(l/e r 



d/2 



Hence, 



A e -I A £ -A e A £ -I 7 (e) 



and so 

A t (e,P„) = A t e i(7(£)+ ° (£2)/£ = A 
Therefore, 

||A t -A t (e,P n )ll = 



+ G + 0(e) 



I + t( 7 (e) + 0{e 2 ))/e + (t( 7 (e) + 0(e)))/e 



< o, 



n£ (d+4)/2 

l0g(l/gn) 
n£ (d+4)/2 



+ 0(e) ^V"*'. □ 



£=1 
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Lemma 1 Let e n — > and nen 2 / log(l/e n ) — > oo. 77zen 

'log(l/e n ) 



L4 e -AJ =0 P 



d/2 

neJ 



Proof. Uniformly, for all f E J 7 , and all x in the support of P, 

\AJ(x) - AJ(x)\ < \AJ(x) - AJ(x)\ + \AJ(x) - AJ(x) 



where A £ f(x) = J a £ (x, y)f(y)dP(y). From Gine and Guillou (2002), 



\%{x) -p £ (x)\ _ ( / log(l/e w ) 
Hence, 

|AJ(x)-2j(x)| < \P0^M^ f \f( y )\k £ (x,y)dP(y) 

\Pe{x)p £ (x)\ J 

'log(l/e n )" 

a 



QpU/ m J /2 7 I / |/(y)IM*,y)dP(!/) 



Next, we bound A £ f(x) — A £ f(x). We have 



'l og(l/e n ) 

d/2 
UErl 



AJ(x) - AJ(x) = I f{y)a £ {x,y){dP n {y) - dP(y)) 

1 



f(y)k £ (x,y)(dP n (y)-dP(y)). 



p(x) + o P (l) 

Now, expand /(y) = /(x) + r n (y) where r n (y) = (y — x) T V f{u y ) and w y is between y and x. So, 

f(y%{x,y)(dP n (y)-dP(y))=f{x) J k e {x,y)(dP n {y)-dP(y))+ J r n {y)k £ (x,y)(dP n (y)-dP(y)) 

By an application of Talagrand's inequality to each term, as in Theorem 5.1 of Gine and Koltchin- 
skii (2006), we have 



f(y)k £ {x,y)(dP n {y)-dP{y))=0 P 



'l og (!/£„ .) 
d/2 

nen 



Thus, 

sup \\A £ f-A £ fU = P (J^^ ) 
f& \ V net 
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This also holds uniformly over {/ e T : \\f\\ = 1}. Moreover, \\A £ f-AJ\\ 2 < C\\AJ - AJ\\ 
for some C since P has compact support. Hence, 

\\AJ-AJ\\ 2 2 ( j \og{l/e n ) \ 

SUP U7\\ = SUP W A ef ~ A efh = P \ J jj^- .□ 

f&F ll/H /e^ll/ll=i VV net! J 



Proof of Theorem [3) Let A n = {\ipi{X)\ < 6 n }. Then 

A c n f}{H(X) ± implies that - ^ X {X)\ > <5 r , 

Also, sup^, \^)\{x) — V'e.i^)! < c£ n f° r some c > 0. Hence, 

P (H(X) ± H(X)) = f(h(X)^H(X),A^+f(h(X)^H(X),A^ 

< F(A n )+F^H(X)^H(X),A c n ) 

< C6Z + f(\MX)-&,i(X)\ >^n) 

< C5Z + f(\MX) -MX)\ + \i/; e>1 (X)-^ etl (X)\ > <f n ) 

< Ctf£ + P - ^ e ,i(X)| + ce n > 5 n ) 

= Ctf£ + P (\$ e ,i(X) - 1p e ,l(X)\ >5 n - C£„) 

- + T^e 

< C5 a n +0 P 



log(l/e n ) \ 1 



72£ 



(d+4)/2 / X - C P 



Set 5 = 2ce n and e n = n -V(^+d+8) and so 



P (#(X) ^ (X)) 



2a 



< J7, 4a+8+d . □ 
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